Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 5 de 5
Filtrar
1.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38588559

RESUMO

MOTIVATION: Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. RESULTS: Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. AVAILABILITY AND IMPLEMENTATION: The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.


Assuntos
Aprendizado Profundo , Genômica , Filogenia , Genômica/métodos , Aprendizado de Máquina Supervisionado , Humanos
2.
Genome Res ; 31(4): 564-575, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33712417

RESUMO

Transcriptional enhancers are critical for development and phenotype evolution and are often mutated in disease contexts; however, even in well-studied cell types, the sequence code conferring enhancer activity remains unknown. To examine the enhancer regulatory code for pluripotent stem cells, we identified genomic regions with conserved binding of multiple transcription factors in mouse and human embryonic stem cells (ESCs). Examination of these regions revealed that they contain on average 12.6 conserved transcription factor binding site (TFBS) sequences. Enriched TFBSs are a diverse repertoire of 70 different sequences representing the binding sequences of both known and novel ESC regulators. Using a diverse set of TFBSs from this repertoire was sufficient to construct short synthetic enhancers with activity comparable to native enhancers. Site-directed mutagenesis of conserved TFBSs in endogenous enhancers or TFBS deletion from synthetic sequences revealed a requirement for 10 or more different TFBSs. Furthermore, specific TFBSs, including the POU5F1:SOX2 comotif, are dispensable, despite cobinding the POU5F1 (also known as OCT4), SOX2, and NANOG master regulators of pluripotency. These findings reveal that a TFBS sequence diversity threshold overrides the need for optimized regulatory grammar and individual TFBSs that recruit specific master regulators.


Assuntos
Células-Tronco Embrionárias/metabolismo , Elementos Facilitadores Genéticos , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Humanos , Camundongos , Células-Tronco Pluripotentes/metabolismo
3.
F1000Res ; 6: 52, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28344774

RESUMO

As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH).

4.
J Biol Eng ; 8: 16, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25057291

RESUMO

We discuss fluorescence as a method to detect polycyclic aromatic hydrocarbons and other organic molecules, as well as minerals on the surface of Mars. We present an instrument design that is adapted from the ChemCam instrument which is currently on the Mars Science Lander Rover Curiosity and thus most of the primary components are currently flight qualified for Mars surface operations, significantly reducing development costs. The major change compared to ChemCam is the frequency multipliers of the 1064 nm laser to wavelengths suitable for fluorescence excitation (266 nm, 355 nm, and 532 nm). We present fluorescence spectrum for a variety of organics and minerals relevant to the surface of Mars. Preliminary results show minerals already known on Mars, such as perchlorate, fluoresce strongest when excited by 355 nm. Also we demonstrate that polycyclic aromatic hydrocarbons, such as those present in Martian meteorites, are highly fluorescent at wavelengths in the ultraviolet (266 nm, 355 nm), but not as much in the visible (532 nm). We conclude that fluorescence can be an important method for Mars applications and standoff detection of organics and minerals. The instrument approach described in this paper builds on existing hardware and offers high scientific return for minimal cost for future missions.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA